This report investigates the dataset PatientInfo.csv from Korean research project group Data Science for COVID-19 (DS4C) to provide insights in regards to COVID-19 patient information in Korea. It first provides a initial data analysis on the dataset, where provenance and limitation was assessed as relevant and trusted to a certain extent as the dataset is a reprocessed collection based on government official reports. Domain knowledge on COVID-19 impact and Korea repsonse to the pandemic is also provided. Missingness was explored and presented to acknowledge the limitation of the dataset.
Two research questions were then investigated, focusing on simliarities across patient age groups in provinces with most confirmed cases, and trend in major patient infection sources over time in Korea. Main discoveries include the 20s age group having the highest number of patients, and the change in trend in infection sources with regards to local cluster cases and policies.
Data Provenance: Content, Management and Use
The dataset PatientInfo.csv is published by Jihoo Kim, chief research director of the Korean research project group DS4C, and retrieved via their Kaggle dataset publish page (https://www.kaggle.com/kimjihoo/coronavirusdataset?select=PatientInfo.csv), as part of a larger collection of datasets with information in regards to COVID-19 pandemic in Korea, where it was published with license CC BY-NC-SA 4.0. As documented by the research group, patient information records are based on official reports released by the government department Korean Centre for Disease Control and Prevention (KCDC) and other local governments. Documentation on dataset structure and variables description can be access via the research group’s Kaggle official kernel.
Assessment of data and Limitation
The dataset has high relevance and understandability, it provides information on patient demographic information, with detailed documentation on data structure and variables.
Limitation on trustworthiness should be acknowledged, as although there is detailed documentation on how the dataset was built, involved researchers were mainly university students and not professional researchers. However, it should also be noted that this dataset is well recognised by the Korean data science community, being sponsored by industry institutions as well as being cited under other researches, and hence can be regarded as reliable to a certain extent even though it is not published directly by the government.
As the dataset is reprocessed from government reports, there is also limitation on the coverage of testing and information collection done by the government, and hence has limited representation of the actual entire patient population in Korea.
The COVID-19 global pandemic refers to the spread of an infectious disease caused by severe acute respiratory syndrome coronavirus (Australian Government Department of Health, 2020). The disease was first identified in December 2019 in Wuhan, China and since then more than 10 million cases have been reported globally, resulting in more than 500,000 deaths, and has been an ongoing pandemic as of date.
There is no known effective medical treatment towards the disease, which results in challenges faced by governments and communities in handling the pandemic. The disease is known to be asymptomatic, where disease carriers may not show symptoms, and has a high case fatality rate in patients of older age groups (Whiting, 2020).
Korea’s response to the pandemic was cited as a model example in controlling the spread of the disease with extensive testing and control policies (Bendix, 2020).
The dataset has 5165 patient information records, with 14 variables listed below.
library(tidyverse)
library(lubridate)
library(naniar)
library(dplyr)
kdata <- read_csv("Patientinfo.csv")
dim(kdata)
## [1] 5165 14
names(kdata)
## [1] "patient_id" "sex" "age"
## [4] "country" "province" "city"
## [7] "infection_case" "infected_by" "contact_number"
## [10] "symptom_onset_date" "confirmed_date" "released_date"
## [13] "deceased_date" "state"
The classes of variables is listed below, where majority are qualitative variables being classified as chr, with patient_id and contact_number as num class.
str(kdata)
## tibble [5,165 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ patient_id : num [1:5165] 1.4e+09 1.0e+09 1.0e+09 1.3e+09 1.4e+09 ...
## $ sex : chr [1:5165] "female" "male" "male" "female" ...
## $ age : chr [1:5165] "30s" "50s" "20s" "40s" ...
## $ country : chr [1:5165] "China" "Korea" "Korea" "Korea" ...
## $ province : chr [1:5165] "Incheon" "Seoul" "Seoul" "Gwangju" ...
## $ city : chr [1:5165] "etc" "Gangseo-gu" "Mapo-gu" NA ...
## $ infection_case : chr [1:5165] "overseas inflow" "overseas inflow" "overseas inflow" "overseas inflow" ...
## $ infected_by : num [1:5165] NA NA NA NA NA ...
## $ contact_number : chr [1:5165] NA "75" "9" "450" ...
## $ symptom_onset_date: chr [1:5165] "19/1/2020" "22/1/2020" "26/1/2020" "27/1/2020" ...
## $ confirmed_date : chr [1:5165] "20/1/2020" "23/1/2020" "30/1/2020" "3/2/2020" ...
## $ released_date : chr [1:5165] "6/2/2020" "5/2/2020" "15/2/2020" "20/2/2020" ...
## $ deceased_date : chr [1:5165] NA NA NA NA ...
## $ state : chr [1:5165] "released" "released" "released" "released" ...
## - attr(*, "problems")= tibble [1 × 5] (S3: tbl_df/tbl/data.frame)
## ..$ row : int 3800
## ..$ col : chr "infected_by"
## ..$ expected: chr "no trailing characters"
## ..$ actual : chr ", 1500000055"
## ..$ file : chr "'Patientinfo.csv'"
## - attr(*, "spec")=
## .. cols(
## .. patient_id = col_double(),
## .. sex = col_character(),
## .. age = col_character(),
## .. country = col_character(),
## .. province = col_character(),
## .. city = col_character(),
## .. infection_case = col_character(),
## .. infected_by = col_double(),
## .. contact_number = col_character(),
## .. symptom_onset_date = col_character(),
## .. confirmed_date = col_character(),
## .. released_date = col_character(),
## .. deceased_date = col_character(),
## .. state = col_character()
## .. )
Note that for variables related to dates, such as symptom_onset_date, confirmed_date etc, the data class is of chr in the original dataset, where the Date format might be more appropriate for data analysis.
Majority of the patients are Korean citizens, we shall limit our scope in investigation to this subset to give more significant insights.
# Patient cases by country showing 5123 out of 5165 of patient records are Korean citizens
kdata %>% group_by(country) %>% tally() %>% arrange(desc(n))
## # A tibble: 16 x 2
## country n
## <chr> <int>
## 1 Korea 5123
## 2 China 11
## 3 Foreign 7
## 4 United States 6
## 5 Bangladesh 5
## 6 Indonesia 2
## 7 Thailand 2
## 8 Canada 1
## 9 France 1
## 10 Germany 1
## 11 India 1
## 12 Mongolia 1
## 13 Spain 1
## 14 Switzerland 1
## 15 United Kingdom 1
## 16 Vietnam 1
# This table shows a summary on missing values in the dataset.
miss_var_summary(kdata)
## # A tibble: 14 x 3
## variable n_miss pct_miss
## <chr> <int> <dbl>
## 1 deceased_date 5099 98.7
## 2 symptom_onset_date 4476 86.7
## 3 contact_number 4374 84.7
## 4 infected_by 3820 74.0
## 5 released_date 3578 69.3
## 6 age 1380 26.7
## 7 sex 1122 21.7
## 8 infection_case 919 17.8
## 9 city 94 1.82
## 10 confirmed_date 3 0.0581
## 11 patient_id 0 0
## 12 country 0 0
## 13 province 0 0
## 14 state 0 0
# This is a visualisation of the combined missing values in the dataset.
vis_miss(kdata, warn_large_data = FALSE)
Some variables have high missingness of over 50%, they are deceased_date, symptom_onset_date, contact_number, infected_by and released_date. These might be because not every patient have relevant records regarding these stages, for example, only a small percentage of recorded patients died from the disease, hence the variable deceased_date has a high missingness of 98% as it does not apply to majority of the recorded patients.
Other variables such as age, sex, infection_case have moderate level of missingness of from around 17% to around 26%. This provides indication to wrangle and filter the dataset to extract relevant subsets of records for our investigation.
Variables regarding geographic information of patients and the confirmed_date variable have low missingness.
Overall, the dataset has 34.4% missingness across all variables.
We further limit our scope to the top 3 provinces in Korea with most confirmed cases to utilise the relevance of the dataset.
# Patient cases by province showing majority of cases are from top 3 provinces
kdata %>% filter(country == 'Korea', is.na(age) == FALSE) %>% group_by(province) %>% tally() %>% arrange(desc(n))
## # A tibble: 17 x 2
## province n
## <chr> <int>
## 1 Gyeongsangbuk-do 1244
## 2 Gyeonggi-do 825
## 3 Seoul 575
## 4 Chungcheongnam-do 166
## 5 Busan 143
## 6 Daegu 131
## 7 Gyeongsangnam-do 129
## 8 Daejeon 119
## 9 Incheon 91
## 10 Gangwon-do 59
## 11 Chungcheongbuk-do 56
## 12 Ulsan 52
## 13 Sejong 51
## 14 Gwangju 44
## 15 Jeollabuk-do 26
## 16 Jeollanam-do 23
## 17 Jeju-do 13
Korea was cited as being effective in controlling the spread through extensive testing regardless of symptom presence (Bendix, 2020). Insights might help researchers or the government to better target potential patients and treatments.
First, we subset patients with country as Korea and age group recorded.
# is.na(age) == FALSE filters rows with age variable that is not NA
kdata %>% filter(country == 'Korea', is.na(age) == FALSE)
## # A tibble: 3,747 x 14
## patient_id sex age country province city infection_case infected_by
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 1000000001 male 50s Korea Seoul Gang… overseas infl… NA
## 2 1000000004 male 20s Korea Seoul Mapo… overseas infl… NA
## 3 1300000001 fema… 40s Korea Gwangju <NA> overseas infl… NA
## 4 1400000003 male 50s Korea Incheon Mich… etc NA
## 5 2000000005 male 40s Korea Gyeongg… Suwo… contact with … 2000000002
## 6 2000000007 fema… 40s Korea Gyeongg… Suwo… contact with … 2000000005
## 7 1000000014 fema… 60s Korea Seoul Jong… contact with … 1000000013
## 8 1000000015 male 70s Korea Seoul Seon… Seongdong-gu … NA
## 9 1000000029 fema… 20s Korea Seoul Jong… Eunpyeong St.… 1000000028
## 10 6001000039 fema… 60s Korea Gyeongs… Gyeo… <NA> NA
## # … with 3,737 more rows, and 6 more variables: contact_number <chr>,
## # symptom_onset_date <chr>, confirmed_date <chr>, released_date <chr>,
## # deceased_date <chr>, state <chr>
Then, to find provinces with the most number of cases, we group by province, and list counts in descending order.
# Count and list number of patients in provinces in order
kdata %>% filter(country == 'Korea', is.na(age) == FALSE) %>% group_by(province) %>% tally() %>% arrange(desc(n))
## # A tibble: 17 x 2
## province n
## <chr> <int>
## 1 Gyeongsangbuk-do 1244
## 2 Gyeonggi-do 825
## 3 Seoul 575
## 4 Chungcheongnam-do 166
## 5 Busan 143
## 6 Daegu 131
## 7 Gyeongsangnam-do 129
## 8 Daejeon 119
## 9 Incheon 91
## 10 Gangwon-do 59
## 11 Chungcheongbuk-do 56
## 12 Ulsan 52
## 13 Sejong 51
## 14 Gwangju 44
## 15 Jeollabuk-do 26
## 16 Jeollanam-do 23
## 17 Jeju-do 13
The top 3 provinces with most patients are Gyeongsangbuk-do, Gyeonggi-do and Seoul.
Next, ordering to age groups was added by changing the variable from class chr to factor, so that the ordering is in ascending order of age groups instead of alphabetical.
# Change age to factor and add levels to age groups
kdata$age <- factor(kdata$age, levels=c("0s", "10s", "20s", "30s", "40s", "50s", "60s", "70s", "80s", "90s", "100s"))
To answer the question, we filter the dataset to the top 3 provinces, and produce a comparative bar plot to show the distribution across these provinces.
# Comparative bar plot showing age group distribution in top provinces
kdata %>% filter(country == 'Korea', is.na(age) == FALSE, province == 'Seoul' | province == 'Gyeongsangbuk-do' | province == 'Gyeonggi-do') %>% ggplot(aes(province, fill=as.factor(age))) + geom_bar(position="dodge") + scale_fill_discrete("Age Groups") + labs(x="Provinces in Korea", y="Number of Patients", title="Number of patients by age groups across Korean provinces with most confirmed cases")
There is similarity in patient age group distribution across top provinces, where the 20s age group has the most number of patients.
Follow up investigation: Distribution of number of deceased patients across age groups in Korea
It is observed that elderlies have a high case fatality rate (Whiting, 2020). Let us see whether the number of deceased patients in Korea follow this observation as a follow up investigation.
# Provinces with deceased patients arranged in order
kdata %>% filter(country == 'Korea', is.na(deceased_date) == FALSE, is.na(age) == FALSE) %>% group_by(province) %>% tally() %>% arrange(desc(n))
## # A tibble: 5 x 2
## province n
## <chr> <int>
## 1 Gyeongsangbuk-do 40
## 2 Daegu 20
## 3 Gangwon-do 3
## 4 Daejeon 1
## 5 Ulsan 1
# Comparative bar plot showing deceased patients age group distribution in top 2 provinces
kdata %>% filter(country == 'Korea', is.na(deceased_date) == FALSE, is.na(age) == FALSE, province == "Gyeongsangbuk-do" | province == "Daegu") %>% ggplot(aes(province, fill=as.factor(age))) + geom_bar(position="dodge") + scale_fill_discrete("Age Groups") + labs(x="Provinces in Korea", y="Number of Deceased Patients", title="Number of deceased patients by age group in provinces with most deaths")
The barplot shows that age groups 70s and 80s have the highest number of deceased patients, this supports the observations. Combining insights on distribution of confirmed patients, this gives insights to how younger patients might be carriers of the disease even though they have a lower case fatality rate (Sadler, 2020), and justifies extensive testing to control the spread by identifying asymptomatic carriers (Bendix, 2020).
Similarities were found in age group distributions across top provinces, where the 20s age group has the most number of confirmed patients, and the age groups 70s and 80s have the most number of deceased patients.
We shall limit our scope to the top two infection sources, namely contact with patient and overseas inflow. According to the Korea pandemic timeline, the Shincheonji Church cluster contributed to the first local community wave (Shin, 2020), we shall also include this in our visualisation.
Insights might help researchers or government policy makers to analyse or devise pandemic response policies that are effective in controlling infection sources.
# Count and list number of patients by infection_case in order
kdata %>% filter(country == 'Korea', is.na(infection_case) == FALSE) %>% group_by(infection_case) %>% tally() %>% arrange(desc(n))
## # A tibble: 51 x 2
## infection_case n
## <chr> <int>
## 1 contact with patient 1606
## 2 overseas inflow 811
## 3 etc 702
## 4 Itaewon Clubs 162
## 5 Richway 128
## 6 Guro-gu Call Center 112
## 7 Shincheonji Church 106
## 8 Coupang Logistics Center 80
## 9 Yangcheon Table Tennis Club 44
## 10 Day Care Center 43
## # … with 41 more rows
Firstly, we change variables relating to dates to a more appropriate format Date.
# Change format from chr to Date
kdata <- kdata %>%
mutate(`symptom_onset_date` = dmy(`symptom_onset_date`),`confirmed_date` = dmy(`confirmed_date`), `released_date` = dmy(`released_date`), `deceased_date` = dmy(`deceased_date`))
Then, for cleaner visualisation, we create a new variable Month to group cases by their recorded month in confirmed_date, and order Months in chronological instead of alphabetical order.
# Create new variable Month using confirmed_date
kdata <- kdata %>%
mutate(`Month` = month(`confirmed_date`))
# Add order to Month
kdata <- kdata %>%
mutate(Month = factor(month.name[Month], levels = month.name))
Lastly, we filter patients from the top infection sources, count the respective cases for each source across months and store it in a new vector.
# Vector to store filtered subset of selected infection_case patients
count <- kdata %>% filter(country == 'Korea', is.na(infection_case) == FALSE, infection_case == 'contact with patient'| infection_case == 'overseas inflow' | infection_case == 'Shincheonji Church') %>% select(confirmed_date, Month, infection_case) %>% group_by(Month, infection_case) %>% tally()
count
## # A tibble: 14 x 3
## # Groups: Month [6]
## Month infection_case n
## <fct> <chr> <int>
## 1 January contact with patient 4
## 2 January overseas inflow 6
## 3 February contact with patient 199
## 4 February overseas inflow 14
## 5 February Shincheonji Church 82
## 6 March contact with patient 567
## 7 March overseas inflow 322
## 8 March Shincheonji Church 24
## 9 April contact with patient 193
## 10 April overseas inflow 245
## 11 May contact with patient 210
## 12 May overseas inflow 96
## 13 June contact with patient 433
## 14 June overseas inflow 128
To answer the question, this line plot shows trends in patients from different infection sources. The Shincheonji Church infection is seen to be responsible for the 567 of cases peak in March for contact with patient, subsequent to the cluster appearing in February.
# Line plot showing trends in number of patients from major infection sources over time
count %>% ggplot(aes(x = Month, y = n, group = infection_case, color = infection_case)) + geom_line() + geom_point(aes(color=infection_case)) + labs(x="Months", y="Number of Patients", title="Number of patients by major infection cases over time in Korea")
# Numbers in February and March for contact with patient and overseas inflow
count %>% filter(Month == "March" | Month == "February") %>% group_by(Month, infection_case) %>% tally()
## # A tibble: 6 x 3
## # Groups: Month [2]
## Month infection_case n
## <fct> <chr> <int>
## 1 February contact with patient 199
## 2 February overseas inflow 14
## 3 February Shincheonji Church 82
## 4 March contact with patient 567
## 5 March overseas inflow 322
## 6 March Shincheonji Church 24
Both infection sources for contact with patient and overseas inflow peaked in March, decreased from March to April and May, with a second peak in June.
The Shincheonji Church cluster was seen as contributing to local community spread (Shin, 2020), where it contributed 82 cases in February as a source. From the graph, we can see that the peak in March for contact with patient recorded 567 cases after the cluster appeared in February.
According to the Korea pandemic timeline (Cha, 2020), the decrease for the two infection cases is likely due to global travel bans and local social distancing measures, where it is effective as numbers have decreased. The second peak in June is observed as the numbers increase after ease of restriction late May (Jones, 2020).
The top 2 patient infection cases, contact with patient and overseas inflow, peaked in March, decreased from March to April and May respectively, and saw a second increase in June.
Data wrangling was useful in exploring the missingness, manipulating, reshaping and visualising the dataset.
R packages used for data wrangling
Summarising and visualising missingness in the initial data exploration helped me make more conscious decisions in choosing variables of interest for the research questions based upon the relevance of the dataset. In investgating the research questions, I utilised the age group, confirmed_date, location to subset and mutate the dataset, filtering patient records by country and excluding NA value records, and grouping them by age groups or provinces.
Since the dataset I have chosen is composed mainly of qualitative variables, dplyr’s tally() was useful in counting observations filtered by conditions, and I was able to get numerical statistics from the dataset to analyse significant variables based on number of observations, such as top provinces with most number of patients or top infection sources, as well as generate visualisations of mixed variables. Reformatting the variables was useful in ordering categories, such as age group or Month, where originally they were classified as chr in the dataset, this makes the dataset more logical and cleaner for analysis and visualisation.
Access to dataset
Kim, J. (2020, July 01). Data Science for COVID-19 (DS4C) in Korea. Retrieved July 02, 2020, from https://www.kaggle.com/kimjihoo/coronavirusdataset?select=PatientInfo.csv
Lee, J. (n.d.). DS4C (Data Science for COVID-19) Project. Retrieved July 02, 2020, from https://github.com/ThisIsIsaac/Data-Science-for-COVID-19
Domain knowledge
Australian Government Department of Health. (2020, July 02). What you need to know about coronavirus (COVID-19). Retrieved July 06, 2020, from https://www.health.gov.au/news/health-alerts/novel-coronavirus-2019-ncov-health-alert/what-you-need-to-know-about-coronavirus-covid-19
Media articles for research question 1
Bendix, A. (2020, March 6). South Korea has tested 140,000 people for the coronavirus. That could explain why its death rate is just 0.6% — far lower than in China or the US. Retrieved July 06, 2020, from https://www.msn.com/en-au/news/other/south-korea-has-tested-140000-people-for-the-coronavirus-that-could-explain-why-its-death-rate-is-just-06-25-e2-80-94-far-lower-than-in-china-or-the-us/ar-BB10OyZU
Sadler, R. (2020, March 16). Coronavirus: New graph shows people in their 20s are more asymptomatic and not being tested for COVID-19. Retrieved July 06, 2020, from https://www.newshub.co.nz/home/world/2020/03/coronavirus-new-graph-shows-people-in-their-20s-are-more-asymptomatic-and-not-being-tested-for-covid-19.html
Whiting, K. (2020, March 12). An expert explains: How to help older people through the COVID-19 pandemic. Retrieved July 06, 2020, from https://www.weforum.org/agenda/2020/03/coronavirus-covid-19-elderly-older-people-health-risk/
Media articles for research question 2
Cha, V., Kim, D. (2020, March 27). A Timeline of South Korea’s Response to COVID-19. Retrieved July 06, 2020, from https://www.csis.org/analysis/timeline-south-koreas-response-covid-19
Jones, S., Anderson, C. (2020, June 23). Global report: South Korea has Covid-19 second wave as Israel ponders new lockdown. Retrieved July 06, 2020, from https://www.theguardian.com/world/2020/jun/22/coronavirus-global-report-new-covid-19-cases-surge-south-korea-israel
Shin, Y., Berkowitz, B., Kim, M. (2020, March 25). How a South Korean church helped fuel the spread of the coronavirus. Retrieved July 06, 2020, from https://www.washingtonpost.com/graphics/2020/world/coronavirus-south-korea-church/?itid=ap_youjinshin